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ABSTRACT 

MicroRNAs (miRNAs) are small RNAs ~22 nt in 
length that are involved in the regulation of a 
variety of physiological and pathological processes. 
Advances in high-throughput small RNA sequencing 
(smRNA-seq), one of the next-generation sequen- 
cing applications, have reshaped the miRNA 
research landscape. In this study, we established 
an integrative database, the YM500 (http://ngs.ym. 
edu.tw/ym500/), containing analysis pipelines and 
analysis results for 609 human and mice 
smRNA-seq results, including public data from the 
Gene Expression Omnibus (GEO) and some private 
sources. YM500 collects analysis results for miRNA 
quantification, for isomiR identification (incl. RNA 
editing), for arm switching discovery, and, more 
importantly, for novel miRNA predictions. Wetlab 
validation on >100 miRNAs confirmed high correl- 
ation between miRNA profiling and RT-qPCR results 
(ff = 0.84). This database allows researchers to 
search these four different types of analysis 
results via our interactive web interface. YM500 
allows researchers to define the criteria of 
isomiRs, and also integrates the information of 
dbSNP to help researchers distinguish isomiRs 
from SNPs. A user-friendly interface is provided to 
integrate miRNA-related information and existing 
evidence from hundreds of sequencing datasets. 
The identified novel miRNAs and isomiRs hold the 
potential for both basic research and biotech 
applications. 



INTRODUCTION 

MicroRNAs (miRNAs) are a family of small RNAs, 
~22 nt in length, that act as key post-transcriptional regu- 
lators of gene expression, modulating the translational ef- 
ficiency and/or the stability of target mRNAs. Small RNA 
sequencing (smRNA-seq), one of the next-generation 
sequencing (NGS) applications, enables detection and 
profiling of miRNAs with particularly high levels of sen- 
sitivity and accuracy (1). Furthermore, smRNA-seq 
allows discovery of previously uncharacterized miRNA 
species and has revealed unexpected complexity among 
miRNAs. These smRNA-seq discoveries include not 
only novel miRNAs but also a series of miRNA 
variants, termed isomiRs (2). In recent years, smRNA- 
seq datasets have grown rapidly and have been deposited 
in public databases such as the Gene Expression Omnibus 
(GEO) (3) and ArrayExpress (4). Exploration of these 
massive datasets, however, remains a daunting challenge, 
and an integrative meta-analysis of all smRNA-seq 
datasets has not yet been well performed. 

For the detection and identification of novel miRNAs, 
smRNA-seq is very promising because it is not as 
time-consuming and expensive as small RNA cloning 
methods (5). Many sequencing software tools have been 
developed to identify novel miRNAs. Li et al. evaluated 
eight software tools, namely miRDeep (6), miRanalyzer 
(7), miRTRAP (8), MIReNa (9), mirTools (10), DSAP 
(11), miRNAkey (12) and mireap (13), based on their 
common features and key algorithms, and recommended 
the best tools in predicting novel miRNAs for different 
data types (5). 

IsomiRs are commonly reported in deep sequencing 
studies (14-27) and are unlikely to be due simply to deg- 
radation or sequencing errors (25,28). These variants have 
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been reported to be biologically relevant and functionally 
cooperative partners of canonical miRNAs (14,21). The 
variations present in IsomiRs can be grouped into three 
types: editing (nucleotide substitution), trimming and 
addition (29). The latter two types cause 5' and 3' 
end-length heterogeneity of miRNAs. Editing is a conse- 
quence of adenosine or cytidine deaminase activities and 
causes nucleotide changes at different positions of the 
mature miRNAs (15,16,28,30-33). It has previously been 
shown that several miRNAs, edited in the seed sequence 
and with an increased level of editing throughout devel- 
opment, result in diversifying target recognition (16). 
Trimming results in the shorter mature miRNAs 
compared with the canonical ones. The 3'-to-5' 
exoribonuclease Nibbler has also been reported to 
control 3' end processing of miRNAs in Drosophila 
(34,35). Non-template nucleotide additions at the 3' end 
of miRNAs have been reported as the common form of 
miRNA enzymatic modification (18,21,28) and can influ- 
ence miRNA stability (36) and the efficiency of target re- 
pression (37). It has further been revealed that the 
frequency of 3' addition to specific miRNAs changes 
with differentiation of human embryonic stem cells (18). 
Several enzymes, such as MTPAP, PAPD4, PAPD5, 
ZCCHC6, ZCCHC11 and TUT1, have been reported to 
govern 3' nucleotide addition to miRNAs (18,36-38). 

The miR/miR* nomenclature has been used to repre- 
sent the dominant and minor mature products of precur- 
sor miRNAs. However, several studies have reported that 
the arm that makes the dominant product can change in 
different tissues, stages and species (14,39-43). Such 
changes have been called 'arm switching' and are likely 
to be general (39). For example, Grimson et al. have 
reported an instance of developmental arm switching 
between the embryonic and adult stages of sponges for 
miR-2015 (43). Cloonan et al. also listed several 
miRNAs whose dominant arm changes over different 
human tissues (14). Therefore, in release 19 of the 
miRBase database (miRBase R19), the nomenclature for 
mature miRNAs now designates them as -5p and -3p, 
rather than miR/miR*, in all species. 

In this study, we present the YM500 database, which 
includes pipelines for miRNA quantification, isomiR iden- 
tification, arm switching discovery and novel miRNA 
prediction from smRNA-Seq data. YM500 contains the 
results of meta-analysis from hundreds of public smRNA- 
Seq datasets and dozens of in-house ones. YM500 aims to 
provide researchers with integrated miRNA-related infor- 
mation with various graphical visualization pages from 
hundreds of sequencing datasets via a user-friendly web 
interface. 



MATERIALS AND METHODS 

Data collection and pre-processing 

As shown under Pre-processing in Figure 1, there are 609 
Illumina smRNA-Seq datasets, including 468 human and 
141 mouse ones, in YM500. 34 out of 609 are in-house 
datasets, and the others are from the GEO public reposi- 
tory. All in-house data were deposited in the GEO 



database with an accession number of GSE39841. 
Detailed information for the datasets is provided in 
Supplementary Table SI. 3' adaptors of the FASTQ raw 
data were trimmed, and the trimmed data were collapsed 
by the FASTX toolkit (http://hannonlab.cshl.edu/fastx_ 
toolkit). A QC report, including length distribution, box 
plot of phred quality scores, etc., was then generated. For 
the public datasets containing only sequences and read 
counts, we transformed them into FASTA format and 
examined their length distribution. It has been suggested 
that if the smRNA-seq reads are abundant in 19~24nt, 
they are good for miRNA analysis (44). Thus, datasets 
which did not fit this criterion were discarded. For all 
datasets, reads shorter than 17nt or longer than 30 nt 
were filtered out. 

Prediction of novel miRNAs and their downstream targets 

It has been noted that there is no evidence that the 
miRNAs may represent fragments of mRNAs or other 
known RNA types (45). As shown in the Novel miRNA 
Prediction module in Figure 1, before predicting 
novel miRNAs, we used Bowtie (46) with options: -v 0 - 
f -norc to filter out reads that map to known miRNA 
precursors (miRBase) (45), other functional RNAs 
(Rfam) (47) and mRNAs (RefSeq) (48). Then, we 
adopted three prediction tools, namely miRDeep2, 
miReap and miRanalyzer, for predicting novel miRNAs. 
All of the prediction results were merged and unified to 
remove redundant records. According to our experience, 
there are still some predicted novel miRNAs that were 
mapped to known transcripts. We further applied 
BLAST to remove reads which were fully mapped to 
RefSeq or Rfam with identity >90%. To get a more 
reliable result, we also used miReNa as a second filter to 
filter out those that do not satisfy numerical Criteria I-V 
to describe a pre-miRNA in miReNa. About two-thirds of 
the putative miRNAs could not fulfil miReNa criteria and 
were filtered out. Finally, we filtered out the putative 
miRNAs located in exon regions (defined by RefSeq). 
These nitrations aimed to reduce the false positive rate. 
Filtration results are shown in Supplementary Figure SI. 
Most of the putative novel miRNAs that were predicted 
by multiple algorithms were preserved. There are 90 and 
637 putative novel miRNAs predicted by three algorithms 
and any two algorithms, respectively (Supplementary 
Figure SI). Second structures of the stem-loop miRNA 
precursors were predicted by RNAfold (49). 
Furthermore, the target genes of these putative miRNAs 
were predicted by two well-known algorithms: TargetScan 
(50) and miRanDa (51). All information was stored in a 
local MySQL server. 

miRNA quantification, isomiR identification and arm 
switching discovery 

As shown in Figure 1, predicted novel miRNAs were 
combined with known miRNAs from miRBase R19. All 
pre-processed datasets were mapped to the combined 
miRNA list using Bowtie with options: -a -v 1 -S -f 
-norc, and then alignment results were produced in a 
BAM file format by SAMtools (52). The BAM files were 
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Figure 1. Schematic representation of data processing. 



processed by in-house JAVA software for miRNA quan- 
tification and isomiR identification. These analysis results 
were then stored in a local MySQL database. The arm 
switching events between two groups of samples were 



determined by two criteria: one was that the averaged 
RPKM (Reads Per Kilobase of transcript per Million 
mapped reads) value of a miRNA must be larger than 
100. The second was that the ratios of two arms (5p/3p) 
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in two groups must be significantly different according to 
the Student T-test with a P-value < 0.05 performed with 
the R language. 

Validation of miRNA expression via stem-loop 
real-time PCR 

Quantification of mature miRNAs was validated by a 
stem-loop real-time RT-qPCR system performed as previ- 
ously described (53). Samples ranging from 100 ng to 1 ug 
of total RNA were used to perform reverse transcription 
(RT) using the RevertAid Reverse transcriptase kit 
(K1622; Fermentas, Glen Burnie, Maryland, USA) as 
directed by the manufacturer. Real-time PCR reactions 
were performed using Maxima™ SYBR Green qPCR 
Master Mix (K0222; Fermentas, Glen Burnie, Maryland, 
USA), and the specific products were detected and 
analysed using the StepOne™ sequence detector 
(Applied Biosystems, USA). Primers were designed on 
the basis of the sequenced miRNAs by using FastPCR 
(54). The miRNA expression data were normalized 
against U6 small nuclear RNA. 

WEB INTERFACE 

YM500 provides four interactive query interfaces 
(Expression, Novel miRNAs, isomiRs and arm- switching) 
and various graphical visualization pages to present the 
analysis results of hundreds of smRNA-Seq datasets. 

Expression 

YM500 allows expression visualization according to a 
user's customized selections. This feature helps users to 
select miRNAs according to ID lists or miRNA cluster 
definitions (Supplementary Figure S2A). For a query re- 
garding a single miRNA, we provide the histogram ex- 
pression in all samples and the expression profiles by 
tissue type of the specific miRNA. For a query regarding 
multiple miRNAs, samples can be selected according to 
the annotation (Supplementary Figure S2B). A recheck 
page (Supplementary Figure S2C) helps users to select 
the specific miRNA and samples for heatmap visualiza- 
tion (Figure 2A). A download link for the normalized ex- 
pression data of selected samples and miRNAs is also 
provided for further analysis. Differential expression of 
114 miRNAs was confirmed by RT-qPCR. The 
Pearson's correlation coefficient, R, between NGS 
analysis and RT-qPCR results was 0.84 (Figure 2B). 

Novel miRNAs 

Whenever researchers identify potentially novel miRNAs, 
they can search for existing evidence of the novel miRNAs 
among the hundreds of samples in YM500. Novel 
miRNAs can be searched for according to the exact 
sequence of a mature miRNA or genomic location. 
Figure 3A illustrates the provided mature and precursor 
miRNA information, including the prediction algorithms, 
the predicted target genes, the expression profiles, the 
RNA secondary structures and the hyperlinks to three 
commonly used genome browsers. A density plot of 
reads that mapped to the stem-loop precursor indicates 



the percentage of reads overlapping the loci (Figure 3B). 
This provides an overview of length heterogeneity for a 
miRNA. Figure 3C shows a view of the deep sequencing 
reads and illustrates all sequences that map to the same 
novel miRNA. In the same figure, we also provide read 
numbers and numbers of datasets/samples in which a 
novel miRNA sequence was found. Reads can be filtered 
according to the number of mismatches to the hairpin 
sequence, the read count and the number of datasets/ 
samples. Furthermore, each specific sequence (for 
example, the sequence in the rectangle in Figure 3C) has 
a hyperlink to a page (shown in Figure 3D) which contains 
the detailed information for the sequence, including the 
expression histogram, the raw counts and the RPM 
(Reads Per Million mapped reads) in each sample. 

IsomiRs 

This section helps researchers to find the isomiRs of the 
known miRNAs in miRBase. As shown in Figure 4A, 
users can define the criteria of isomiRs according to 
number of mismatch, number of read counts, number of 
expressed samples and isomiR types (trimming or addition 
at 5'/3' end). Figure 4B illustrates the information of 
edited sites, which are determined by the editing rate in 
Figure 4A. Editing information is also compared to 
dbSNP (Build 135). As shown in Figure 4B, both editing 
and an SNP were found at the 23rd base of 
hsa-miR-21 l-5p. However, the type of nucleotide alter- 
ation is different, indicating that such nucleotide substitu- 
tion is a putative editing event rather than a common 
variant. All of the isomiRs defined in Figure 4A are 
detailed in Figure 4C, which indicates the mismatch 
sites, number of reads and number of samples containing 
the isomiR. Similar to the view of NGS data in the Novel 
miRNAs part (Figure 3C), each sequence in Figure 4C has 
a hyperlink to a page, shown in Figure 3D, which sum- 
marizes the details of the specific isomiR. Figure 4D shows 
the summary information of all isomiRs of a specific 
mature miRNA by a sequence logo format (where the 
height of each character is proportional to the total read 
counts of a miRNA). Figure 4E is similar to Figure 4D, 
but the height of each character is normalized to the read 
counts in each base and indicates the editing rate in each 
base. Similar to Figure 3B, Figure 4F is a density plot 
showing the distribution of reads in a mature miRNA 
and illustrates the length heterogeneity. 

Arm switching 

As far as arm switching is concerned, YM500 provides 
two ways to investigate this phenomenon. YM500 allows 
users to select a specific precursor miRNA and profiles the 
expression of two arms between samples and tissues 
(Figure 5A). This helps researcher to quickly view arm 
switching events in a specific miRNA species. Another 
method of illustration allows users to select two groups 
of samples, according to annotations for the database, and 
YM500 will identify precursor miRNAs whose dominant 
expression switches from one arm to the other between the 
two groups (Figure 5B). 
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Figure 2. miRNA expression. (A) The heatmap visualization for the expression of the selected miRNAs across samples. (B) The comparison of NGS 
and RT-qPCR in quantification of mature miRNAs. 



DISCUSSION 

YM500 integrates the analysis results of miRNA profiling/ 
quantification, isomiR detection, novel miRNA prediction 
and arm switching identification. The reliability of in silico 
data were tested via experimental validation on in-house 
samples or by paper survey. For quantification, our 
miRNA profiling results were highly consistent to those 
of RT-qPCR, with a Person's correlation coefficient 0.84 
(Figure 2D). For novel miRNA discovery, we used 
multi-algorithms to explore as many putative miRNAs 
as possible and adopted several filtration steps to reduce 
false positive discoveries. The definition of 'novel miRNA' 
in YM500 is with respect to miRBase and we are collect- 
ing the information of 'novel miRNA' claimed by other 
references as another source of evidences but this task is 
undergoing. A dozen of miRNAs have been validated 
in our lab for their existence and expression patterns 
by RT-qPCR (Supplementary Table S2). Using 
'NM_hsa_1300' as an example (Figure 3), wetlab 
evidence such as the RT-qPCR melting curve proved its 
existence (Supplementary Figure S3). However, RT-qPCR 
can prove the 'existence' of putative miRNAs. We use 
Agol/2-mediated RNA-immunoprecipitation (RNA-IP) 
plus further sequencing as another line of evidence, 
and found that >1000 novel miRNAs in YM500 are 
indeed associated with the RNA-induced silencing 
complex (RISC) (unpublished data). For isomiRs, a 
U-to-G substitution in the ninth base of mmu-let-7a-5p 
(32) is discovered in YM500. Such substitution events 
(putative editing event) may result in a significant 
increase in stability of down-regulated targets (32). The 
second example is that we also discovered three reported 
A-to-I putative editing events in three miRNAs, which 
were the 7th, 8th and 9th bases of the mature product, 
and edited in 25%, 18% and 11% of the reads in (55), 
indicating that additional variability is tolerated in the 
functionally important seed region. 

As for an example of arm switching, for hsa-mir-154 
(Figure 5A), Cloonan et al. have reported that the 



expression of the two arms would be switched in different 
tissues (14). They demonstrated the switch between ex- 
pression dominance from the 3p arm (ovary) to the 5p 
arm (brain and placenta). It has been shown that alterna- 
tive mature miRNAs produced from the same precursor 
have different targeting properties and therefore different 
biological functions (56). Hence, the changes in arm 
choice of hsa-mir-154 might have significant functional 
consequences. The expression of a specific arm may 
dominate some tissue-specific functions. Besides, 
hsa-mir-144 has been reported as a cancer/disease 
marker in several studies (57-59) but our arm switching 
results show that it might have significant tissue- 
specificity. It may need to be further investigated the 
mechanisms of hsa-mir-144 which are related to cancer 
and tissue-specificity. 

The advantage of using smRNA-Seq for miRNA 
researches is that smRNA-Seq can discover novel 
miRNAs and isomiRs. At this point, the number and func- 
tions of isomiRs remain unclear, but it has been reported 
that they might be quite prevalent in creatures. YM500 
allows researchers to define the criteria of isomiRs, as 
well as providing existing evidence for various isomiRs 
from hundreds of smRNA-Seq datasets. A representative 
example of expression patterns and raw read counts in 
each smRNA-seq dataset is shown in Figure 4. We 
defined a candidate isomiR by the criteria shown in 
Figure 4A (allow one mismatch; the read count and 
sample count are at least 100 and 5, respectively). At the 
same time, the editing event is defined by the editing rate at 
least 1% of the total reads (pass the criteria) mapped to 
hsa-miR-211-5p. (Please note that all of criteria could also 
be customer-defined). In this case, 13 sites of the canonical 
miRNAs have isomiRs with mismatch, but there is only 
one putative editing event in the 23rd base with 834 sup- 
porting reads in more than a dozen of datasets (Figure 4B 
and C). Besides, there are two distinct isomiRs that have 
the same editing site (the rectangle in Figure 4C). We also 
provide the detail information of each isomiR via 
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Figure 3. Novel miRNA information. (A) The information for a novel mature miRNA (NM_hsa_1300; NM: novel mature miRNAs in YM500) and 
its precursor miRNA (NP_hsa_17866 and _683; NP: novel precursor miRNAs in YM500). (B) A density plot illustrating reads mapping to the 
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hyperlinks on web interface. These isomiRs still need ex- 
tensive wetlab validations to prove their existence and 
functionality. Especially, the editing events need genomic 
evidences from the same sample to rule out novel SNPs or 
mutations. However, YM500 is a screening tool to help 
researchers reduce the numbers of candidate isomiRs and 



could be severed as a line of evidences for their existence. 
We believe these customer-defined criteria and selections 
would help researchers to exclude most sequencing errors 
and other artifacts. 

The same holds true for novel miRNAs. When a small 
RNA is consistently detected in various datasets, it does 
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constitute autonomous evidence that it is prevalent and 
thus the likely result of a specific biogenesis (6). As 
shown in Figure 3, for a putative novel miRNA, YM500 
provides the expression profiles, the prediction algorithms, 
the sequences, the counts mapping to the miRNA, the 
secondary structure, etc. This information helps re- 
searchers to evaluate the prediction result. For example, 
according to the suggestions of Kozomara and 
Griffiths-Jones (45), the pattern of reads focusing on the 



mature region (Figure 3B) supports a high-confidence 
miRNA annotation with multiple reads (10-20 as 
cutoffs) from independent datasets. In contrast, the 
pattern of reads overlapping the sequence of a putative 
miRNA does not support the annotation of a miRNA, 
with multiple offset reads distributed across the locus 
(45). Besides, most reads mapping to a given mature 
miRNA annotation should have the same 5' end 
whereas the 3' end may be significantly more variable 
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Arm switching. (A) The expression profiles of two arms of hsa-mir-154 between samples and tissues. (B) The precursor miRNAs with arm 
event, identified by YM500, between two groups of customer-defined samples. 



(45). YM500 helps researchers to check these characteris- 
tics via the presentation page (Figure 3). For isomiRs and 
novel miRNAs, YM500 provides existing evidence from 
hundreds of smRNA-Seq datasets. Whenever researchers 
find interesting results (such as some specific isomiRs and 
novel miRNAs) in their own datasets, they can validate 
their results in YM500. 

In comparison to other databases related to 
smRNA-seq, including miRBase, deepBase (60) and the 
isomiRs database of Lee et al. (24), YM500 analyse 
miRNA-related information from many dimensions, and 
datasets included in meta-analysis are more abundant. 
The deepBase database contains 185 smRNA-Seq 
datasets for seven organisms and is also a platform for 
annotating and discovering small and long non-coding 
RNAs (miRNAs, siRNAs, piRNAs, etc.) from next-gen- 
eration sequencing data. The isomiRs database of Lee 
et al. contains only 18 smRNA-Seq datasets for two or- 
ganisms and lists only the isomiR information in a single 
sample per query. Although miRBase covers much more 
species, miRBase does not include expression profiles of 
known miRNAs, RNA editing, arm switching or miRNA 
prediction results. For novel miRNA, YM500 adopts four 
different algorithms and provides target prediction, ex- 
pression profiles of novel miRNA and hyperlinks to 
genome browsers. deepBase lists only the results of 
novel miRNA predicted by miRDeep, and no other add- 
itional information is provided. Besides, there are only 79, 
9 and 5 human smRNA-seq datasets in miRBase, 
deepBase and the isomiRs database of Lee et al., respect- 
ively. YM500 contains 468 human datasets (from public 
databases or in-house), which is the most comprehensive 
collection so far. Another unique part of YM500 is that 



our interactive web interface and customer-defined criteria 
also help researchers to retrieve these four types of 
analysis results. Finally, to our best knowledge, there is 
no other resource providing arm switching information. 
Comparing with these databases, YM500 provides a 
flexible web interface, more enhanced resolution and 
novel findings owing to the integrated pipelines for 
miRNA (including miRNA expression, IsomiRs, novel 
miRNAs and arm-switching) and the large number of 
smRNA-Seq datasets of various tissue/cell types. 
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